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Vh ■ 

^^ ■ Abstract. We reconsider the stochastic (sub)gradient approach to the 

^^ ' unconstrained primal Ll-SVM optimization. We observe that if the learn- 

fT^ , ing rate is inversely proportional to the number of steps, i.e., the number 

^S| ' of times any training pattern is presented to the algorithm, the update 

rule may be transformed into the one of the classical perceptron with 

a' ' . margin in which the margin threshold increases linearly with the number 

' of steps. Moreover, if we cycle repeatedly through the possibly randomly 

l_^ I permuted training set the dual variables defined naturally via the expan- 

lyj ■ sion of the weight vector as a linear combination of the patterns on which 

O ' margin errors were made are shown to obey at the end of each complete 

, cycle automatically the box constraints arising in dual optimization. This 

, ■ renders the dual Lagrangian a running lower bound on the primal objec- 

►^ ' five tending to it at the optimum and makes available an upper bound 

rT) • on the relative accuracy achieved which provides a meaningful stopping 

QQ ' criterion. In addition, we propose a mechanism of presenting the same 

^f^ I pattern repeatedly to the algorithm which maintains the above proper- 

\^ . ties. Finally, we give experimental evidence that algorithms constructed 

_^ ' along these lines exhibit a considerably improved performance. 

en ■ 

7—i [ 1 Introduction 

• '~j , Support Vector Machines (SVMs) [111714] have been extensively used as hnear 

r> ' classifiers either in the space where the patterns originally reside or in high di- 

j^ I mensional feature spaces induced by kernels. They appear to be very successful 

at addressing the classification problem expressed as the minimization of an 
objective function involving the empirical risk while at the same time keeping 
low the complexity of the classifier. As measures of the empirical risk various 
quantities have been proposed with the 1- and 2-norm loss functions being the 
most widely accepted ones giving rise to the optimization problems known as Ll- 
and L2-SVMs |3]. SVMs typically treat the problem as a constrained quadratic 
optimization in the dual space. At the early stages of SVM development their ef- 
ficient implementation was hindered by the quadratic dependence of their mem- 
ory requirements on the number of training examples a fact which rendered 
prohibitive the processing of large datasets. The idea of applying optimization 
only to a subset of the training set in order to overcome this difficulty resulted in 



the development of decomposition methods |13I8) . Although such methods led 
to improved convergence rates, in practice their superlinear dependence on the 
number of examples, which can be even cubic, can still lead to excessive run- 
times when dealing with massive datasets. Recently, the so-called linear SVMs 
|9I6I7I11| taking advantage of linear kernels in order to allow parts of them to 
be written in primal notation succeeded in outperforming decomposition SVMs. 

The above considerations motivated research in alternative algorithms natu- 
rally formulated in primal space long before the advent of linear SVMs mostly 
in connection with the large margin classification of linearly separable datasets 
a problem directly related to the L2-SVM. Indeed, in the case that the 2-norm 
loss takes the place of the empirical risk an equivalent formulation exists which 
renders the dataset linearly separable in a high dimensional feature space. Such 
alternative algorithms ([12] and references therein) are mostly based on the per- 
ceptron 14, the simplest online learning algorithm for binary linear classifica- 
tion, with their key characteristic being that they work in the primal space in 
an online manner, i.e., processing one example at a time. Cycling repeatedly 
through the patterns they update their internal state stored in the weight vector 
each time an appropriate condition is satisfied. This way, due to their ability 
to process one example at a time, such algorithms succeed in sparing time and 
memory resources and consequently become able to handle large datasets. 

Since the Ll-SVM problem is not known to admit an equivalent maximum 
margin interpretation via a mapping to an appropriate space fully primal large 
margin perceptron-like algorithms appear unable to deal with such a tasklll Nev- 
ertheless, a somewhat different approach giving rise to online algorithms was de- 
veloped which focuses on the minimization of the regularized 1-norm soft margin 
loss through stochastic gradient descent (SGD). Notable representatives of this 
approach are the pioneer NORMA TO (see also [18]) and Pegasos |15I16| . SGD 
gives rise to a kind of perceptron-like update having as an important ingredi- 
ent the "shrinking" of the current weight vector. Shrinking always takes place 
when a pattern is presented to the algorithm with it being the only modification 
suffered by the weight vector if no loss is incurred. Thus, due to lack of a mean- 
ingful stopping criterion the algorithm without user intervention keeps running 
forever. In that sense the algorithms in question are fundamentally different from 
the mistake-driven large margin perceptron-like classifiers which terminate after 
a finite number of updates. There is no proof even for their asymptotic conver- 
gence when they use as output the final hypothesis but they do exist probabilistic 
convergence results or results in terms of the average hypothesis. 

In the present work we reconsider the straightforward version of SGD for the 
primal unconstrained Ll-SVM problem assuming a learning rate inversely pro- 
portional to the number of steps. Therefore, such an algorithm can be regarded 



^ The Margin Perceptron with Unlearning (MPU) [TT] addresses the Ll-SVM problem 
by keeping track of the number of updates caused by each pattern in parallel with 
the weight vector which is updated according to a perceptron-like rule. In that sense 
MPU uses dual variables and should rather be considered a linear SVM which, 
however, possesses a finite time bound for achieving a predefined relative accuracy. 



either as NORMA with a specific dependence of the learning rate on the number 
of steps or as Pegasos with no projection step in the update and with a single 
example contributing to the (sub)gradient {k = 1). We observe here that this al- 
gorithm may be transformed into a classical perceptron with margin j5j in which 
the margin threshold increases linearly with the number of steps. The obvious 
gain from this observation is that the shrinking of the weight vector at each 
step amounts to nothing but an increase of the step counter by one unit instead 
of the costly multiplication of all the components of the generally non-sparse 
weight vector with a scalar. Another benefit arising from the above simplified 
description is that we are able to demonstrate easily that if we cycle through the 
data in complete epochs the dual variables defined naturally via the expansion of 
the weight vector as a linear combination of the patterns on which margin errors 
were made satisfy automatically the box constraints of the dual optimization. 
An important consequence of this unexpected result is that the relevant dual La- 
grangian which is expressed in terms of the total number of margin errors, the 
number of complete epochs and the length of the current weight vector provides 
during the run a lower bound on the primal objective function and gives us a 
measure of the progress made in the optimization process. Indeed, by virtue of 
the strong duality theorem the dual Lagrangian and the primal objective coin- 
cide at optimality. Therefore, assuming convergence to the optimum an upper 
bound on the relative accuracy involving the dual Lagrangian may be defined 
which offers a useful and practically achievable stopping criterion. Moreover, we 
may now provide evidence in favor of the asymptotic convergence to the opti- 
mum by testing experimentally the vanishing of the duality gap. Finally, aiming 
at performing more updates at the expense of only one costly inner product 
calculation we propose a mechanism of presenting the same pattern repeatedly 
to the algorithm consistently with the above interesting properties. 

The paper is organized as follows. Section 2 describes the algorithm and 
its properties. In Section 3 we give implementational details and deliver our 
experimental results. Finally, Section 4 contains our conclusions. 



2 The Algorithm and its Properties 



Assume we are given a training set {ixk,lk)}^^i, with vectors a;^ e IR and 
labels Ik S {-1-1,— 1}. This set may be either the original dataset or the result 
of a mapping into a feature space of higher dimensionality |17|4j . By placing 
Xk in the same position at a distance p in an additional dimension, i.e., by ex- 
tending Xk to [xk , p] , we construct an embedding of our data into the so-called 
augmented space [5] . The advantage of this embedding is that the linear hypoth- 
esis in the augmented space becomes homogeneous. Following the augmentation, 
a reflection with respect to the origin of the negatively labeled patterns is per- 
formed allowing for a uniform treatment of both categories of patterns. We define 
R = max ||yj,|| with j/j, = [IkXk, hp] the k-th augmented and reflected pattern. 

k 



Let us consider the regularized empirical risk 

-. m 

— Y^ max{0, l-w-y^} 



^ II l|2 

fc=l 



involving the l-norm soft margin loss maxjO, 1 — w ■ y^.} for the pattern t/j, and 
the regularization parameter A controlling the complexity of the classifier w. 
For a given dataset of size m minimization of the regularized empirical risk with 
respect to w is equivalent to the minimization of the objective function 

jiw,c) = -\\wf + cJ2^^^{o^-'w-yk} , 

k=l 

where the "penalty" parameter C > is related to A as 

Xm 

This is the Ll-SVM problem expressed as an unconstrained optimization. 

The algorithms we are concerned with are classical SGD algorithms. The term 
stochastic refers to the fact that they perform gradient descent with respect to 
the objective function in which the empirical risk (l/m) X^a—i niax{0, 1 — ii; • y^,} 
is approximated by the instantaneous risk max{0, 1 — tw-y^.} on a single example. 
The general form of the update rule is then 



Wt+i ^wt- r]tVwt 



1 2 1 

-\\wt\\ + - max{0, 1 - u?f • y^} 



where ijt is the learning rate and '^Wt stands for a subgradient with respect to 
Wt since the l-norm soft margin loss is only piecewise differentiable (i > 0). We 
choose a learning rate 774 = l/(t + 1) which satisfies the conditions X^t^o ^? < °° 
and Xt^o 'Ht = 00 usually imposed in the convergence analysis of stochastic 
approximations. Then, noticing that Wt — jrr^^Wt = jTr^^Wt, we obtain the update 

ict+i = Wf -\ -yi, (1) 

whenever 

Wfyk<^ (2) 

and 

t 
■^t+i = J^'^t (3) 

otherwise. In deriving the above update rule we made the choice Wt — A^^y^, for 
the subgradient at the point WfTji^ — 1 where the l-norm soft margin loss is not 
differentiable. We assume that wq = 0. We see that ii wt ■ yf^ > 1 the update 
consists of a pure shrinking of the current weight vector by the factor t/{t + l). 



The update rule may be simplified considerably if we perform the change of 
variable 

n„ = - (4) 

for t > and Wq = Qq = for t = 0. In terms of the new weight vector at the 
update rule becomes 

at+i ^at + Vk (5) 

whenever 

atyk< Ai (6) 

and 

at+i = at (7) 

otherwiseo This is the update of the classical perceptron algorithm with margin 
in which, however, the margin threshold in condition ^ increases linearly with 
the number of presentations of patterns to the algorithm independent of whether 
they lead to a change in the weight vector at. Thus, t counts the number of times 
any pattern is presented to the algorithm which corresponds to the number of 
updates (including the pure shrinkings ([3])) of the weight vector Wt- Instead, the 
weight vector at is updated only if ([6]) is satisfied meaning that a margin error 
is made on y^.. 

In the original formulation of Pegasos [H] the update is completed with a 
projection step in order to enforce the bound \\wt\\ < 1/vA which holds for the 
optimal solution. We show now that this is dynamically achieved to any desired 
accuracy after the elapse of sufficient time. In practice, however, it is in almost 
all cases achieved after one pass over the data. 

Proposition 1. For t > the norm of the weight vector Wt is bounded from 
above as follows 




(8) 

Proof. From the update rule ([5]) taking into account condition ^ under which 
the update takes place we get 

||a(+i||' - llatll' = WVkf + 2at ■yk<R^ + 2At . 

Obviously, this is trivially satisfied if ^ is violated and ^ holds. A repeated 
application of the above inequality with Oq = gives 

i-l 

llotll^ <R^t + 2X^k^R^t + Xt{t-l) = {R^ ~X)t + Xt^ 

fc=0 

from where using ^ and taking the square root we obtain (|5]). D 



^ For i = ((6]) becomes ao ■ y/. < instead of ao • j/t. < 1 which is obtained from ([2| 
with Wo = ao. Since both are satisfied with ao — ^ may be used for all t. 



The Stochastic Gradient Descent Algorithm 
with random selection of examples 



Combining ([5]) with the initial 
choice Wo = we see that for all 
t the weaker bound \\wt\\ < R/X 
previously derived in 116 1 holds. 

SGD gives naturally rise to 



Input: Adataset 5= (y^, . . . , y^^, . . . , j/„) 
with augmentation and reflection assumed 
Fix: C t 

Define:' T^ f/(Cm) onhne algorithms. Therefore, we 
Initialize: t = 0, ao = may choose the examples to be 
while t < fmax do presented to the algorithm at ran- 
Choose y^, from S randomly dom. However, the Ll-SVM op- 
if o-t ■ y^ < \t then timization task is a batch learn- 
I t+i t Uk jj-^g pi-oblem which may be bet- 
l_ at+i — at ter tackled by online algorithms 
t ^ t + 1 via the classical conversion of such 
wt — at/{\t) algorithms to the batch setting. 
This is done by cycling repeat- 
edly through the possibly ran- 
domly permuted training dataset 
and using the last hypothesis for prediction. This traditional procedure of pre- 
senting the training data to the algorithm in complete epochs has in our case, as 
we will see shortly, the additional advantage that there exists a lower bound on 
the optimal value of the objective function to be minimized which is expressed 
in terms of quantities available during the run. The existence of such a lower 
bound provides an estimate of the relative accuracy achieved by the algorithm. 

Proposition 2. Let us assume that at some stage the whole training set has 
been presented to the algorithm exactly T times. Then, it holds that 

Xp^{C) = mmJ{w,C)>£^ = C^~^\\w^\\^ , (9) 

where M is the total number of margin errors made up to that stage and w^ = 
Wijy^x) the weight vector at t = mT with m being the size of the training set. 

Proof. Let If. denote the number of margin errors made on the pattern y^, up 
to time t such that a* = X^fe ^IVk- Obviously, it holds that 

< 4"^^ < T (10) 

since y^. up to time t = mT has been presented to the algorithm exactly T 
times. Then, taking into account (U) we see that at time t the dual variable a^ 
associated with y^. is a|, — If/i^Xt) and consequently the dual variable aj^™ 
after T complete epochs is given by 

>" XmT T ^ ' 

With use of p^ we readily conclude that the dual variables after T complete 
epochs automatically satisfy the box constraints 

< ai^"^^ < C . (12) 



From the weak duality theorem it follows that 

J{w, C) > C{a.) ^^OLk- -Yl °''^3y^ ■ Vj 



2 

k t,j 



where C{a.) is the dual Lagrangiar|j and the variables a^ obey the box con- 
straints < afc < C. Thus, setting ak — a^™ we get 

k i,j 

Substituting w^ = W(j^rp\ — '12k '^k wfc ^^ ^^^ above inequality and noticing 
that J2k "i'"^^ = C^/T) Efc 4™^^ = CM/T we obtain ® . D 

In the course of proving Proposition [3] we saw that although the algorithm is 
fully primal the dual variables a|, defined through the expansion Wt = X]fc (^kVk 
of the weight vector Wt as a linear combination of the patterns on which margin 
errors were made obey after T complete epochs automatically the box constraints 
(|12|) encountered in dual optimizationQ This surprising result allows us to con- 
struct the dual Lagrangian >C^ which provides a lower bound on the optimal 
value jTopt of the objective J' and assuming C^ > to obtain an upper bound 
J I CJ — 1 on the relative accuracy J / Jo-pt — 1 achieved as the algorithm keeps 
running. Thus, we have for the first time a primal SGD algorithm which may 
use the relative accuracy as stopping criterion|f| It is also worth noticing that 
jC^ involves only the total number M of margin errors and does not require that 
we keep the values of the individual dual variables during the run. 

Although the automatic satisfaction of the box constraints by the dual vari- 
ables is very important it is by no means sufficient to ensure vanishing of the 
duality gap and consequently convergence to the optimal solution. To demon- 
strate convergence to the optimum relying on dual optimization theory we must 
make sure that the Karush-Kuhn- Tucker (KKT) conditions |17|4j are satisfied. 



Maximization of C{a.) subject to the constraints < Qfe < C is the dual of the 
primal Ll-SVM problem expressed as a constrained minimization. 
We expect that the dual variables will also satisfy the box constraints in the limit 
i — > oo if the patterns presented to the algorithm are selected randomly with equal 
probability since asymptotically they will all be selected an equal number of times. 
It is, of course, computationally expensive to evaluate at the end of each epoch the 
exact primal objective. Thus, an approximate calculation of the loss using the value 
that the weight vector had the last time each pattern was presented to the algorithm 
is preferable. This way we exploit the already computed inner product at ■ j/j. which 
is needed in order to decide whether condition ([5} is satisfied. If this approximate 
calculation gives a value of the relative accuracy which is not larger than / times 
the one set as stopping criterion we proceed to a proper calculation of the primal 
objective. The comparison coefficient / is given empirically a value close to 1. 



The Stochastic Gradient Descent Algorithm 
with relative accuracy £ 

Input: A dataset S = {y-^^, ... ,y ,.,... ,y^) 
with augmentation and reflection assumed 
Fix: C, e, /, T^ax 

Define: qk = \\yk\? , A = l/(Cm), e' = fe 
Initialize: t = 0, T = 0, M = 0, ro = 0, ao -- 
while T < T^ax do 
if T ^ then 

e = \t 

J = w2 + CL 
C = CM/T - w2 
if J- C<e'C then 
L = 

for A; = 1 to m do 

Pk = at- y^ 

ii Pk < S then 



Pk 



■ Their approximate satisfaction 
demands that the only patterns 

■ which have a substantial loss 
be the ones which have dual 
variables equal or at least ex- 
tremely close to C (bound sup- 
port vectors) and moreover that 
the patterns which have zero loss 
and margin considerably larger 
than 1/ 1 1 11'"^ II should have van- 
ishingly small dual variables. Pat- 
terns with margin very close to 
1/ if"^ may have dual variables 
with values between and C and 
play the role of the non-bound 
support vectors. From (fTTj) we see 
that the dual variable associated 
with the fc-th pattern is equal to 
CTk/T where n = li""^^ is the 
number of epochs for which the 
fc-th pattern was found to be a 
margin error. It is apparent that 
if there exists a number of epochs 
no matter how large it may be af- 
ter which a pattern is consistently 
found to be a margin error then 
in the limit T — > oo we will have 
(Tk/T) — > 1 and the dual variable 
associated with it will asymptoti- 
cally approach C. In contrast, if a 
pattern after a specific number of 
epochs is never found to be a mar- 
gin error then {T^/T) — >• and 
its dual variable will tend asymp- 
totically to zero reflecting the ac- 

. cumulated effect of the shrinking 
that the weight vector suffers each 
time a pattern is presented to the 
algorithm. Therefore, the algorithm has the necessary ingredients for asymptotic 
satisfaction of the KKT conditions for the vanishing of the duality gap. The po- 
tential danger remains, however, that they may exist patterns with margin not 
very close to 1/ ||if^|| which do not belong to any of the above categories and 
occasionally either become margin errors although most of the time are not or 
become classified with sufficiently large margin despite of the fact that they are 
most of the time margin errors. The hope is that with time the changes in the 



L^ L/e 

J = w2 + CL 
a J - C < tC then 
L break 



Permute(S) 

L = 

for A; = 1 to m do 
Ptk = aty^ 

et = \t 

if Ptk < dt then 

at+i = at+y^. 
rt+i =rt + 2ptk + Qk 
M ^ M + 1 
if i > then 

I L^L + l-ptk/( 
else 

L L^L + l 

else 

I at+i = at 

\_ rt+i = n 
t ^t + 1 
T^T + l 
Wt = at/{\t) 



weight vector Wt will become smaller and smaller and such events will become 
more and more rare leading eventually to convergence to the optimal solution. 

The above discussion cannot be regarded as a formal proof of the asymp- 
totic convergence of the algorithm. We believe, however, that it does provide 
a convincing argument that assuming convergence (not necessarily to the opti- 
mum) the duality gap will eventually tend to zero and the lower bound £^ on 
the primal objective J^ given in Proposition [5] will approach the optimal pri- 
mal objective jTopt, thereby proving that convergence to the optimum has been 
achieved. If, instead, we make the stronger assumption of convergence to the op- 
timum then, of course, the vanishing of the duality gap follows from the strong 
duality theorem. In any case the stopping criterion exploiting the upper bound 
J I C7 — 1 on the relative accuracy J / Japt — 1 is a meaningful one. 

Our discussion so far assumes that in an epoch each pattern is presented 
only once to the algorithm. We may, however, consider the option of presenting 
the same pattern y^. repeatedly to the algorithm aiming at performing more 
updates at the expense of only one calculation of the costly inner product at-yi^. 
Proposition[5]and the analysis following it will still be valid on the condition that 
all patterns in each epoch are presented exactly the same number i of times to 
the algorithm. Then, such an epoch should be regarded as equivalent to £ usual 
epochs with single presentations of patterns to the algorithm and will have as a 
result the increase of t by an amount equal to mi. 

It is, of course, important to be able to decide in terms of just the initial value 
of at ■ y^, how many, let us say £+ , out of these £ consecutive presentations of the 
pattern y^. to the algorithm will lead to a margin error, i.e., to an update of at, 
with each of the remaining £_ = £ — £+ presentations necessarily corresponding 
to just an increase of i by 1 which amounts to a pure shrinking of Wt- 

Proposition 3. Let the pattern y^. he presented at time t repeatedly £ times to 
the algorithm. Also let 

P = afyk-^i ■ 

Then, the number £^ of times that y^. will be found to be a margin error is given 
by the following formula 



if P> {£-l)X 
if F< (£-l)A 

Here [x] denotes the integer part of x >0. 




{£-l)\-P 
max{||y^.||^A} 



(13) 



Proof. For the sake of brevity we call a plus-step a presentation of the pattern y^ 
to the algorithm which leads to a margin error and a minus-step a presentation 
which does not. If at time t a plus-step takes place Ot+i • y^. — \{t -|- 1) = 
{at ■yj^ — Xt)^{\\yf.\\ —A) while if a minus-step takes place at+i-yj. — A(t+1) = 
(at • yj, — Xt) — A. Thus, a plus-step adds to P the quantity ||yj,|| — A while a 
minus-step the quantity —A. Clearly, after £ consecutive presentations of y^, to 
the algorithm it holds that at+t-y,, - Xit + £) = P + £+{\\y,,f - A) - {£-£+)X. 



If P > (f - 1)A it follows that P - (£ - 1)A > which means that after 
£ — 1 consecutive minus-steps condition ^ is still violated and an additional 
minus-step must take place. Thus, £- — i and i+ — 0. 

For P < {i— 1)A we first treat the subcase max{||y;.|| , A} = A. If lly^ll < A 
and P < condition ([6]) is initially satisfied and will still be satisfied after any 
number of plus-steps since the quantity Hy^lj — A that is added to P with a 
plus-step is non-positive. Thus, £+ — £. This is in accordance with (fT3|) since 
{{£ - 1)A - P)/X > £ - 1 or [{{£ - 1)A - P)/X] + l> £ leading to £+ = £. It 
remains for ||y;.|| < A to consider P in the interval 0<P<(^— 1)A which 
can be further subdivided as (£i — 1)A < P < £iA with the integer £i satisfying 
1 < £i < £ — 1. For P belonging to such a subinterval condition ^ is initially 
violated and will still be violated after ^i — 1 minus-steps while after one more 
minus-step will be satisfied. It will still be satisfied after any number of additional 
plus-steps because the quantity ||y^.|| — A that is added to P with a plus-step is 



non-positive. Thus, £- = £i and £+ = £—ii. This is in accordance with p3|) since 
(€-£i-l)A< i£-l)X-P < (£-^i)A leading to [((^- 1)A-P)/A] + 1 = f-^i- 
The subcase ||yj,|| > A of the case P < {£ ~ 1)A is far more complicated. 
If WlJkW > A with P < —{£— l)(||yfe|| — A) condition ([6]) is initially satisfied 
and will still be satisfied after £ — 1 plus-steps since P + {£ — l)(||yj.|| — A) < 0. 
Thus, £+ = £. This is consistent with ^ because (£-\)\-P>(£~ 1) ||yj,||^ 
or \{{£ — 1)A — P)/ llj/fcll ] + 1 > ^ leading to £+ — £. It remains to be examined 
the case Hy^H^ > A with P in the interval -(£-\){\y^f - X) < P < {£ ~ 1)A. 
The above interval can be expressed as a union of subintervals {£ — £i — 1)A — 
^i(l|y/cf -A) <P< (£-£i)A-(^i-l)(||yfcf -A) with the integer £i satisfying 
1 < £i < ^ — 1. Let P belong to such a subinterval. Let us also assume that 
the pattern y^ has been presented n < £ consecutive times to the algorithm 
as a result of which k+ plus-steps and k_ minus-steps have taken place and 
the quantity K+(||y^.|| — A) — k_A has been added to P. Then Pk_|_,k_ = P + 
K+(||yfef -A)-K_A satisfies {£-£i~l~K^)X-{£i-K+){\\y^\\^~X) < P«+,«_ < 
{£ — £i — K-)X— {£i — l — K+){\\yi^\\ — A). As k increases either k+ will first reach 
the value £i with k_ < £ — i'l or k_ will first reach the value £ — £i with k+ < £i. 
In the former case < (€ — i!i — 1 — K-)X < Pk^.,k_ • This means that condition 
([5]) is violated and will continue being violated until the number of minus-steps 
becomes equal to € — £i — 1 in which case one more minus-step must take place. 
Thus, all steps taking place after k+ has reached the value £i are minus-steps. 
In the latter case Pk.+ ,k^ < ~(^i — 1 — '*-i-)(||yfc|| — A) < 0. This means that 
condition (|6]) is satisfied and will continue being satisfied until the number of 
plus-steps becomes equal to ^i — 1 in which case one more plus-step must take 
place. Thus, all steps taking place after k_ has reached the value £ — £i are 
plus-steps. In both cases £+ = £i. This is again in accordance with ([T3l) because 



{ii - 1) llyfell' < (^- 1)A- P < 4 hkf or [((^- 1)A - P)/ \\y,f] + l^£,. D 

With £_|_ given in Proposition |3] the update of multiplicity £ of the weight 
vector at is written formally as 

at+i ^at+ £+yt . (14) 



3 Implementation and Experiments 

We implement three types of SGD algorithm^ along the lines of the previous 
section. The first is the plain algorithm with random selection of examples, 
denoted SGD-r, which terminates when the maximum number imax of steps is 
reached. Its pseudocode is given in Section 2. The dual variables in this case 
do not satisfy the box constraints as a result of which relative accuracy cannot 
be used as stopping criterion. The SGD algorithm with relative accuracy e, 
the pseudocode of which is also given in Section 2, is denoted SGD-s where 
s designates that in an epoch each pattern is presented a single time to the 
algorithm. It terminates when the relative deviation of the primal objective J 
from the dual Lagrangian C?^ just falls below e provided the maximum number 
Tmax of full epochs is not exhausted. A variation of this algorithm, denoted 
SGD-m, replaces the usual update with the multiple update (1141 which amounts 
to multiple consecutive presentations of each pattern to the algorithm in an 
epoch. We found it advantageous not to perform multiple updates in every epoch. 
The policy we adopt is to perform multiple updates in the T-th epoch only if 
< T mod 9 < 5. When multiple updates are performed their multiplicity i is 
chosen to be ^ = 5. For both SGD-s and SGD-m the comparison coefficient / is 
given the value f — 1.2 unless otherwise explicitly stated. 

Algorithms performing SGD on the primal objective are expected to per- 
form better if linear kernels are employed. Therefore the feature space in our 
experiments will be chosen to be the original instance space. As a consequence, 
our algorithms should most naturally be compared with linear SVMs. Among 
them we choose SVMP'"^ |3 P; , the first cutting-plane algorithm for training lin- 
ear SVMs, the Optimized Cutting Plane Algorithm for SVMtH (OCAS) [6], the 
Dual Coordinate DescenIO (DCD) algorithm [7 and the Margin Perceptron with 
Unlearningij (MPU) JTj. We also include in our study Pegasoa^H (with fc = 1). 

The datasets we used for training are the binary Adult and Web datasets 
as compiled by Platil^n. the training set of the KDD04 Physics datasel|13 (with 
70 attributes after removing the 8 columns containing missing features), the 
Real-sim, News20 and Webspam (unigram treatment) datasetqlj, the multiclass 
Covertype UCI dataseij^ and the full Reuters RCVl dataselo. Their number of 
instances and attributes are listed in Table[T] In the case of the Covertype dataset 



Sources available at 'http : //users . auth . gr/costapan 

Source (version 2.50) available at http://svmlight.joachims.org 



Source (version 0.96) available at http : //cmp . f elk . cvut . cz/~ xf rancv/ocas/html' 
Source available at ht tp : //www . csie . ntu . edu . tw/~ c j lin/liblinear ^ We used the 
slightly faster older liblinear version 1.7 instead of the latest 1.93. 
Source available at http : //users . auth. gr/costapan" 



Source available at http : //ttic . uchicago . edu/~ shai/ code| 
http: //research. microsoft . com/en-us/pro jects/svm/ 
http : //osmot . cs . Cornell . edu/kddcup/datasets . html 
http : //www . csie . ntu . edu . tw/~ c j lin/libsvmtoo ls/datasets| 
[http: //archive . ics .uci .edu/ml/datasets .html 
^^ [http : //www . jmlr . org/papers/volume5/lewis04a/lyrl2004_rcvlv2_README . htm| 



Table 1. The number T of complete epochs required in order for the SGD-s algorithm 
to achieve (J - CT)IC7 < 10~^ for C = 0.1. 



data 
set 


^instances 


^attributes 


SGD-s e = 10-'^ C = 0.1 


T 


J 


C^ 


Adult 


32561 


123 


208174 


1149.904 


1149.893 


Web 


49749 


300 


16849 


755.1139 


755.1064 


Physics 


50000 


70 


13668 


4995.139 


4995.089 


Realsim 


72309 


20958 


4209 


1437.315 


1437.301 


News20 


19996 


1355191 


2178 


902.5611 


902.5521 


Webspam 


350000 


254 


27680 


8284.781 


8284.698 


Covertype 


581012 


54 


712648 


36427.52 


36427.16 


Cll 


804414 


47236 


5670 


5174.432 


5174.381 


CCAT 


804414 


47236 


7987 


12114.29 


12114.17 



we study the binary classification problem of the first class versus rest while for 
the RCVl we consider both the binary text classification tasks of the Cll and 
CCAT classes versus rest. The Physics and Covertype datasets were rescaled by 
multiplying all the features with 0.001. The experiments were conducted on a 2.5 
GHz Intel Core 2 Duo processor with 3 GB RAM running Windows Vista. Our 
codes written in C++ were compiled using the g++ compiler under Cygwin. 

First we perform an experiment aiming at demonstrating that our SGD al- 
gorithms are able to obtain extremely accurate solutions. More specifically, with 
the algorithm SGD-s employing single updating we attempt to diminish the gap 
between the primal objective J and the dual Lagrangian C^ setting as a goal a 
relative deviation {J' — Cj^f CJ < 10~^ for C = 0.1. In the present and in all 
subsequent experiments we do not include a bias term in any of the algorithms 
(i.e., in our case we assign to the augmentation parameter the value p = 0). In 
order to keep the number T of complete epochs as low as possible we increase 
the comparison coefficient / until the number of epochs required gets stabilized. 
This procedure does not entail, of course, the shortest training time but this is 
not our concern in this experiment. In Table [1] we give the values of both J7 
and C^ and the number T of epochs needed to achieve these values. If multiple 
updates are used a larger number of epochs is, in general, required due to the 
slower increase of £-^. Thus, SGD-s achieves, in general, relative accuracy closer 
to e than SGD-m does. This is confirmed by subsequent experiments. 

In our comparative experimental investigations we aim at achieving relative 
accuracy {J — Jopt)/Jopt < 0.01 for various values of the penalty parameter 
C assuming knowledge of the value of Jopt- For Pegasos and SGD-r we use as 
stopping criterion the exhaustion of the maximum number of steps (iterations) 
imax which, however, is given values which are multiples of the dataset size m. 
The ratio tmax/w may be considered analogous to the number T of epochs of 
the algorithm SGD-s since equal values of these two quantities indicate identical 
numbers oiwt updates. The input parameter for SGD-s and SGD-m is the (upper 
bound on) the relative accuracy e. For MPU we use the parameter e — 5 — ^stop. 



Table 2. Training times of SGD algorithms to achieve (^ — Jopt)/ Jopt < 0.01 for 
C = l. 



c = i 1 


data 
set 


Pegasos 


SGD-r 


SGD-s 


SGD-m 1 


Un-^^/rn 


s 


tmax/m 


s 


e 


T 


s 


e 


T 


s 


Adult 


181 


4.4 


116 


0.55 


0.105 


111 


0.56 


0.33 


50 


0.27 


Web 


53 


1.0 


46 


0.34 


0.054 


26 


0.20 


0.1 


14 


0.11 


Physics 


2 


0.20 


6 


0.09 


2.1 


1 


0.03 


0.14 


3 


0.06 


Realsim 


66 


3.4 


70 


2.0 


0.046 


20 


0.58 


0.061 


16 


0.47 


News20 


89 


10.2 


88 


7.5 


0.023 


39 


3.1 


0.029 


25 


2.1 


Webspain 


8 


3.4 


9 


2.1 


0.068 


14 


3.0 


0.21 


5 


1.2 


Covertype 


- 


- 


62 


10.9 


0.264 


65 


8.7 


1.12 


18 


2.5 


Cll 


41 


31.4 


39 


21.1 


0.05 


16 


7.9 


0.136 


8 


4.2 


CCAT 


37 


32.7 


36 


19.8 


0.055 


16 


8.1 


0.163 


7 


4.1 



Table 3. Training times of Unear SVMs to achieve {J ~ Jopt)/Jopt < 0.01 for C = 1. 



C= 1 


data 
set 


SVMP"^* 


OCAS 


DCD 


MPU 


e 


s 


s 


e 


s 


e 


s 


Adult 


0.7 


1.5 


0.08 


2.8 


0.16 


0.02 


0.09 


Web 


0.2 


0.33 


0.30 


6 


0.06 


0.01 


0.05 


Physics 


1.0 


0.30 


0.02 


23 


0.06 


0.06 


0.06 


Realsim 


0.08 


0.80 


0.62 


0.7 


0.22 


0.06 


0.23 


News20 


0.14 


12.8 


6.0 


0.4 


0.64 


0.03 


1.5 


Webspam 


0.5 


7.3 


4.2 


2.5 


1.4 


0.1 


0.98 


Covertype 


4.2 


45.8 


3.4 


6.5 


9.1 


0.1 


6.2 


Cll 


0.09 


12.8 


9.0 


1.4 


3.5 


0.09 


2.5 


CCAT 


0.25 


19.3 


12.9 


1.6 


3.6 


0.1 


3.2 



where 5 is the before-run relative accuracy and (5stop the stopping threshold for 
the after-run relative accuracy. For SYM^'^'^ and DCD we use as input their 
parameter e while for OCAS the primal objective value q — l.OljTopt (not given 
in the tables) with the relative tolerance taking the default value r = 0.01. 
Any difference in training time between Pegasos and SGD-r for equal values 
of imax/jT^ should be attributed to the difference in the implementations. Any 
difference between tmax/w for SGD-r and T for SGD-s is to be attributed to the 
different procedure of choosing the patterns that are presented to the algorithm. 
Finally, the difference in the number T of epochs between SGD-s and SGD-m 
reflects the effect of multiple updates. It should be noted that in the runtime of 
SGD-s and SGD-m several calculations of the primal and the dual objective are 
included which are required for checking the satisfaction of the stopping criterion. 
If SGD-s and SGD-m were using the exhaustion of the maximum number Tmax 
of epochs as stopping criterion their runtimes would certainly be shorter. 

Tables [2] and |3] contain the results of the experiments involving the SGD 
algorithms and linear SVMs for C = 1. We observe that, in general, there is a 



Table 4. Training times of SGD algorithms to achieve (^ — Jopt)/ Jopt < 0.01 for 
C = 10. 



c = io 1 


data 
set 


Pegasos 


SGD-r 


SGD-s 


SGD-m 1 


tmax/m 


s 


tmax/m 


s 


e 


T 


s 


e 


T 


s 


Adult 


- 


- 


1146 


5.3 


0.098 


1172 


5.8 


0.35 


455 


2.4 


Web 


338 


6.5 


330 


2.4 


0.049 


220 


1.7 


0.105 


99 


0.76 


Physics 


12 


1.1 


51 


0.78 


0.203 


17 


0.27 


0.223 


19 


0.33 


Realsiin 


746 


30.5 


738 


21.1 


0.0162 


487 


12.9 


0.027 


261 


7.0 


News20 


796 


76.1 


797 


63.0 


0.0104 


719 


53.0 


0.012 


279 


20.6 


Webspani 


- 


- 


64 


14.5 


0.125 


62 


12.3 


0.4 


26 


5.7 


Covertype 


- 


- 


472 


82.8 


0.343 


462 


60.2 


0.77 


269 


36.3 


Cll 


446 


332.1 


441 


238.7 


0.0415 


178 


82.0 


0.085 


116 


53.9 


CCAT 


- 


- 


387 


212.2 


0.0471 


170 


79.2 


0.112 


98 


46.7 



Table 5. Training times of linear SVMs to achieve (J - Jopt)/ Jopt < 0.01 for C = 10. 



C= 10 1 


data 

set 


SVMP"--' 


OCAS 


DCD 


MPU 1 


e 


s 


s 


e 


s 


e 


s 


Adult 


0.6 


38.0 


0.33 


2.6 


1.2 


0.04 


0.62 


Web 


0.23 


1.4 


0.55 


8 


0.20 


0.02 


0.08 


Physics 


1.3 


2.9 


0.09 


23 


0.30 


0.05 


0.20 


Realsim 


0.031 


4.4 


2.6 


0.25 


0.45 


0.02 


0.41 


News20 


0.019 


147.0 


40.7 


0.2 


1.5 


0.02 


2.1 


Webspani 


0.36 


37.1 


6.8 


2.2 


5.3 


0.2 


2.5 


Covertype 


2.1 


52.0 


17.3 


6.1 


90.8 


0.08 


38.5 


Cll 


0.079 


39.6 


25.0 


0.65 


9.2 


0.02 


5.7 


CCAT 


0.14 


72.0 


35.2 


0.85 


11.7 


0.02 


7.9 



progressive decrease in training time as we move from Pegasos to SGD-m through 
SGD-r and SGD-s due to the additive effect of several factors. These factors are 
the more efficient implementation of our algorithms exploiting the change of 
variable given by ([4]), the presentation of the patterns to SGD-s and SGD-m in 
complete epochs (see also J2I16J ) and the use by SGD-m of multiple updating. 
The overall improvement made by SGD-m over Pegasos is quite substantial. 
DCD and MPU are certainly statistically faster but their differences from SGD- 
m are not very large especially for the largest datasets. Moreover, SGD-s and 
SGD-m are considerably faster than SVMf""^ and statistically faster than OCAS. 
Pegasos failed to process the Covertype dataset due to numerical problems. 

Tables |4] and [5] contain the results of the experiments involving the SGD 
algorithms and linear SVMs for C — 10. Although the general characteristics 
resemble the ones of the previous case the differences are magnified due to the 
intensity of the optimization task. Certainly, the training time of linear SVMs 
scales much better as C increases. Moreover, MPU clearly outperforms DCD and 
OCAS for most datasets. SGD-m is still statistically faster than SVM^'"'^ but 
slower than OCAS. Finally, Pegasos runs more often into numerical problems. 



Table 6. Training times of SGD algorithms to achieve (^ — Jopt)/ Jopt < 0.01 for 
C = 0.05. 



C = 0.05 1 


data 
set 


Pegasos 


SGD-r 


SGD-s 


SGD-m 


tmax/m 


s 


fmax/m 


s 


e 


T 


s 


e 


T 


s 


Adult 


7 


0.17 


12 


0.06 


0.07 


10 


0.05 


0.15 


6 


0.03 


Web 


4 


0.11 


4 


0.03 


0.06 


2 


0.02 


0.06 


2 


0.02 


Physics 


1 


0.09 


1 


0.02 


0.11 


1 


0.02 


0.06 


1 


0.02 


Realsim 


3 


0.27 


3 


0.09 


0.09 


1 


0.06 


0.14 


1 


0.06 


News20 


4 


0.72 


3 


0.33 


0.08 


1 


0.20 


0.12 


1 


0.20 


Webspam 


1 


0.50 


2 


0.47 


0.4 


1 


0.44 


0.18 


1 


0.44 


Covertype 


5 


4.1 


4 


0.72 


0.25 


4 


0.64 


0.27 


5 


0.80 


Cll 


2 


1.6 


3 


1.6 


0.03 


2 


1.5 


0.1 


1 


1.0 


CCAT 


2 


2.1 


2 


1.1 


0.12 


1 


1.0 


0.12 


1 


1.0 



Table 7. Training times of hnear SVMs to achieve {J-Jopt)/Jopt < 0.01 for C = 0.05. 



C = 0.05 


data 
set 


SVMP""^' 


OCAS 


DCD 


MPU 


e 


s 


s 


e 


s 


e 


s 


Adult 


1.1 


0.14 


0.05 


3 


0.03 


0.01 


0.03 


Web 


0.3 


0.09 


0.09 


4 


0.03 


0.01 


0.03 


Physics 


1.1 


0.08 


0.02 


12 


0.03 


0.02 


0.02 


Realsim 


0.7 


0.23 


0.20 


0.7 


0.16 


0.2 


0.16 


News20 


0.9 


1.3 


0.73 


3 


0.23 


0.2 


0.56 


Webspam 


1.1 


2.8 


2.3 


1 


1.1 


0.2 


0.72 


Covertype 


2.9 


53.7 


1.6 


6 


1.2 


0.3 


1.4 


Cll 


0.11 


4.7 


2.9 


4 


1.4 


0.1 


1.4 


CCAT 


0.25 


7.2 


5.7 


1 


2.7 


0.1 


1.6 



In contrast, as C decreases the differences among the algorithms are allevi- 
ated. This is apparent from the results for C — 0.05 reported in Tables |6] and [71 
SGD-r, SGD-s and SGD-m all appear statistically faster than the linear SVMs. 
Also Pegasos outperforms SYM^"' for the majority of datasets with preference 
for the largest ones. Seemingly, lowering C favors the SGD algorithms. 



4 Conclusions 



We reexamined the classical SGD approach to the primal unconstrained Ll-SVM 
optimization task and made some contributions concerning both theoretical and 
practical issues. Assuming a learning rate inversely proportional to the num- 
ber of steps a simple change of variable allowed us to simplify the algorithmic 
description and demonstrate that in a scheme presenting the patterns to the 
algorithm in complete epochs the naturally defined dual variables satisfy au- 
tomatically the box constraints of the dual optimization. This opened the way 



to obtaining an estimate of the progress made in the optimization process and 
enabled the adoption of a meaningful stopping criterion, something the SGD al- 
gorithms were lacking. Moreover, it made possible a qualitative discussion of how 
the KKT conditions will be asymptotically satisfied provided the weight vector 
Wt gets stabilized. Besides, we showed that in the limit i — >■ oo even without a 
projection step in the update it holds that \\wt\\ < 1/\/A, a bound known to be 
obeyed by the optimal solution. On the more practical side by exploiting our sim- 
plified algorithmic description and employing a mechanism of multiple updating 
we succeeded in substantially improving the performance of SGD algorithms. 
For optimization tasks of low or medium intensity the algorithms constructed 
are comparable to or even faster than the state-of-the-art linear SVMs. 
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